Lessons learned in web scraping


In [1]:
import requests

Making requests

Lesson 1 - Use the head!

A head request is a lighter and faster way of checking if a url is serviceable, a given file exists, etc.. LastModified and ContentLength are useful.


In [2]:
req = requests.head('http://www.google.com')
print(req.headers['Content-Length'])


258

In [3]:
req = requests.get('http://www.google.com')
print(req.headers['Content-Length'])


4539

Lesson 2 - Big file? Download async

Show benchmarks

Extraction

Lesson 1 - RegEx patterns or XPath queries

Lesson 2 - Is there any pattern out there?


In [ ]: